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Abstract. In this note we introduce a new randomized algorithm for counting 
triangles in graphs. We show that under mild conditions, the estimate of our 
algorithm is strongly concentrated around the true number of triangles. Specifi- 
cally, if p > max ( ^ '°s " ^ ^^^/f^)' where n, t, A denote the number of vertices in G, 
the number of triangles in G, the maximum number of triangles an edge of G is 
contained, then for any constant e > our unbiased estimate T is concentrated 
around its expectation, i.e., Pr [\T — E [T] | > eE [T]] = o(l). Finally, we present 
a MapReduce implementation of our algorithm. 



1. Introduction 

Triangle counting is a fundamental algorithmic problem with many applications. 
The interested reader is urged to see [17] and references therein. The fastest exact 
triangle counting algorithm to date (in terms of number of edges in the graph) is 

due to Alon, Yuster and Zwick [3] and runs in 0(m"+T), where currently the matrix 
multiplication exponent u is 2.371 [H]. For planar graphs linear time algorithms 
are known, e.g., [18j. Practical methods for exact triangle counting use instead 
enumeration techniques, see e.g., [TH] and references therein. For many applications, 
especially in the context of large social networks, an exact count is not crucial but 
rather a fast, high quality estimate. Most of the work on approximate triangle 
counting is sampling-based and has considered a (semi-)streaming setting O El [3 
[HI |23] . A different line of research is based on a linear algebraic approach [U |2T] . 
Currently to the best of our knowledge, the state-of-the-art approximate counting 
method relies on a hybrid algorithm that first sparsifies the graph and then samples 
triples according to a degree based partitioning trick [T7] . 

In this short note, we present a new sampling approach to approximating the 
number of triangles in a graph G{V,E), which significantly improves existing sam- 
pling approached. Furthermore, it is easily implemented in parallel. The key idea 
of our algorithm is to correlate the sampling of edges such that if two edges of a 
triangle are sampled, the third edge is always sampled. This decreases the degree 
of the multivariate polynomial that expresses the number of sampled triangles. We 
analyze our method using a powerful theorem due to Hajnal and Szemeredi pT] . 
This note is organized as follows: in Section [2] we discuss the theoretical prelimi- 



naries for our analysis and in Section 1.1 we present our randomized algorithm. In 



Section |3] we present our main theoretical results, we analyze our algorithm and we 
discuss some of its important properties. In Section |4] we present an implementation 
of our algorithm in the popular MapReduce framework. Finally, in Section [5] we 
conclude with future research directions. 
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Algorithm 1 Colorful Triangle Sampling 

Require: Unweighted graph G{[n\,E) 
Require: Number of colors N = 1/p 

Let f : V ^ [N] have uniformly random values 

E'^{{u,v}eE\f{u) = f{v)} 

T ^ number of triangles in the graph {V,E') 

return T/p^ 



1.1. Algorithm. Our algorithm, summarized as Algorithm 1, samples each edge 
with probability p, where = 1/p is integer, as follows. Let / : [n] — )■ [A^] be a 
random coloring of the vertices of G{[n],E), such that for all v G [n] and i G [A^], 
Pr [/(f) = i] = P- We call an edge monochromatic if both its endpoints have the 
same color. Our algorithm samples exactly the set E' of monochromatic edges, 
counts the number T of triangles in {[n], E') (using any exact or approximate triangle 
counting algorithm) , and multiplies this count by . 

Previous work [221 [23] has used a related sampling idea, the difference being 
that edges were sampled independently with probability p. Some intuition why 
this sampling procedure is less efficient than what we propose can be obtained by 
considering the case where a graph has t edge-disjoint triangles. With independent 
edge sampling there will be no triangles left (with probability 1 — o(l)) if p^t = o(l). 
Using our colorful sampling idea there will be w(l) triangles in the sample with 
probability 1 — o(l) as long as p'^t = a;(l). This means that we can choose a smaller 
sample, and still get a accurate estimates from it. 



2. Theoretical Preliminaries 



In Section |3.2| we make extensive use of the following version of the Chernoff 
bound [8]. 

Lemma 1 (Chernoff Inequality). Let Xi, X2, . . . , Xk be independently distributed 
{0, 1} variables with E[Xi] = p. Then for any e > 0, we have 

k 

\S"Xi-p\>ep <2e-''P^'^ 



Pr 



i=l 



Hajnal and Szemeredi [TT] proved in 1970 the following conjecture of Paul Erdos: 

Theorem 1 (Hajnal-Szemeredi Theorem). Every graph with n vertices and maxi- 
mum vertex degree at most k is k + 1 colorable with all color classes of size at least 
n/k. 

3. Analysis 

We wish to pick p as small as possible but at the same time have a strong concen- 



tration of the estimate around its expected value. How small can p be? In Section 3.1 
we present a second moment argument which gives a sufficient condition for picking 
p. Our main theoretical result, stated as Theorem [3] in Section 3.2[ provides a suf- 



ficient condition to this question. In Section |3.3| we analyze the complexity of our 
method. 
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3.1. Second Moment Method. Using the Second Moment Method we are able 
to obtain the following strong theoretical guarantee: 

Theorem 2. Let n, t, A, T denote the number of vertices in G, the number of 
triangles in G, the maximum number of triangles an edge of G is contained and the 
number of monochromatic triangles in the randomly colored graph respectively. Also 
let N = ^ the number of colors used. If p > max ( ^ " , ^y^), then T ~ E [T] with 
probability i-o(l). 

Proof. By Chebyshev's inequality, if Yar [T] = o(E [T]^) then T ~ E [T] with prob- 
ability 1 — o(l) [2j. Let Xi be a random variable for the i-th triangle, i = 1, . . . , t, 
such that Xj = 1 if the i-th triangle is monochromatic. The number of monochro- 
matic triangles T is equal to the sum of these indicator variables, i.e., T = Yll=i -^i- 
By the linearity of expectation and by the fact that Pr [Xi = 1] = p"^ we obtain 
that E [T] = pH. We set A = l-^i ^ -^j] where the sum is over ordered 

pairs and i ^ j denotes that the corresponding indicator variables are dependent. 
It is easy to check that the only case where two indicator variables are dependent is 
when they share an edge. In this case the covariance is non-zero and for any p > 0, 
Cov [Xi,Xj] =p^-p^ <p^ 

Hence, we obtain the following upper bound on the variance of T, where 5e is the 
number of triangles edge e is contained and A = maXe(zE(G) ^e- 

Yar [T] <E[T] + A<pH + p^^6l< p^t + 3p^tA 

e 

We pick p large enough to get Yar \X] = o(E [X]^). It suffices: 



pH"^ »pH + 3pHA ^pH»l + 3pA 

We consider two cases: 
Case 1 {pA < 1/3): 



It suffices that p'^t = u{n) where u{n) is some slowly growing function. We pick 
uj{n) = log^ n and hence p > 

• Case 2 (pA > 1/3): 



It suffices that ^ = log n. 

Combining the above two cases we get that if 

A log n log n 

p > max ( , — ^ ) 

^ t ' ^ft ' 

Equation [l] is satisfied and hence X ~ E [X] with probability 1 — o(l). 



□ 



3.2. Concentration via the Hajnal-Szemeredi Theorem. Here, we present 
a different approach to obtaining concentration, based on partitioning the set of 
triangles/indicator variables in sets containing many independent random indicator 
variables and then taking a union bound. Our theoretical result is the following 
theorem: 
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Theorem 3. Let tmax be the maximum number of triangles a vertex v is contained 
in. Also, let n,t,p,T be defined as above, e a small positive constant and d> any 
constant. If > i(^^±3)wiogn ^ ^f^^^ - E [T] | > eE [T]] < ^. 

Proof. Let Xi be defined as above, i = 1, . . . ,t. Construct an auxiliary graph H 
as foUows: add a vertex in H for every triangle in G and connect two vertices 
representing triangles ti and t2 if and only if they have a common vertex. The 
maximum degree of H is Stmax = 0(5^), where S = 0{n) is the maximum degree in 
the graph. Invoke the Hajnal-Szemeredi Theorem on H: we can partition the vertices 
of H (triangles of G) into sets Si, . . . , Sg such that l^il > ^(^r^) and q = O(tmax)- 
Let k = . Note that the set of indicator variables Xj corresponding to any set Sj 
is independent. Applying the Chernoff bound for each set 5'j, i = 1, . . . , g we obtain 



Pr 



If p'^ke'^ > Ad'logn, then 26""^^^^^/^ is upper bounded by n"'^' , where > is a 
constant. Since q = 0{n^) by taking a union bound over all sets Si we see that the 
triangle count is approximated within a factor of e with probability at least 1 — n^^"^ 
Setting d = d' — 3 completes the proof. □ 

3.3. Complexity. The running time of our procedure of course depends on the 
subroutine we use on the second step, i.e., to count triangles in the edge set E'. 
Assuming we use an exact method that examines each vertex independently and 
counts the number of edges among its neighbors (a.k.a. Node Iterator method [19j ) 
our algorithm runs in 0{n + m + p^ X]je[n] deg(i)) expected time by efficiently 
storing the graph and retrieving the neighbors of v colored with the same color as 
V in 0(1 +pdeg(f)) expected time. Note that this implies that the speedup with 
respect to the counting task is 

3.4. Discussion. The use of Hajnal-Szemeredi Theorem in the context of proving 
concentration is not new, e.g., llTj. Despite the fact that the second moment 
argument gave us strong conditions on p, the use of Hajnal-Szemeredi has the poten- 
tial of improving the A factor. The condition we provide on p is sufficient to obtain 
concentration. Note -see Figure [l]- that it was necessary to partition the triangles 
into vertex disjoint rather than edge disjoint triangles since we need mutually inde- 
pendent variables per chromatic class in order to apply the Chernoff bound. Were 
we able to remove the dependencies in the chromatic classes defined by edge disjoint 
triangles, probably the overall result could be improved. It's worth noting that for 
p = 1 we obtain that t > nu{n), where u{n) is any slowly growing function of n. 
This is -to the best of our knowledge- the mildest condition on the triangle density 
needed for a randomized algorithm to obtain concentration. 

Furthermore, the powerful theorem of Kim and Vu [T^l [21] that was used in 
previous work [22] is not immediately applicable here: let be an indicator variable 
for each edge e such that Yg = 1 if and only if e is monochromatic, i.e., both its 
endpoints receive the same color. Note that the number of triangles is a boolean 
polynomial T = |^A(e/c/) O^eYf + YfYg + Y^Yg) but the boolean variables are 



""^We assume that uniform sampling of a color takes constant time. If not, then we obtain the 
term 0(nlog(^) for the vertex coloring procedure. 
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Figure 1. Consider the indicator variable Xi corresponding to the 
i-th triangle. Note that Pr rest are monochromatic] = p ^ 
Pr [Xi] = p^. The indicator variables are pairwise but not mutually 
independent. 

not independent as the Kim-Vu [T^ theorem requires. It's worth noting that the 
degree of the polynomial is two. Essentially, this is the reason for which our method 
obtains better results than existing work [22] where the degree of the multivariate 
polynomial is three fiGl [22] . It's worth noting that previous work p!7l [22] sampled 
edges independently whereas our new method samples subsets of vertices but in 
a careful manner in order to decrease the degree of the multivariate polynomial. 
Finally, it's worth noting that using a simple doubling procedure [22] and the median 
boosting trick of Jerrum, Valiant and Vazirani [13] we can pick p effectively in 
practice despite the fact that it depends on the quantity t which we want to estimate 
by introducing an extra logarithm in the running time. 

Finally, from an experimentation point of view, it's interesting to see how well the 
upper bound 3 At matches the sum ^ee-B(G) ^^e typical values for A and tmax 

in real- world networks. The following table shows these numbers for five networks 
taken from the SNAP library [Ij. We see that A and tmax are significantly less than 
their upperbounds and that typically 3At is significantly larger than J2eeE{G) ^1 
except for the collaboration network of Arxiv Astro Physics. The results are shown 
in Table [U 



Name 


Nodes 


Edges 


Triangle Count 


A 


^max 




3At 


AS 


7,716 


12,572 


6,584 


344 


2,047 


595,632 


6,794,688 


Oregon 


11,492 


23,409 


19,894 


537 


3,638 


2,347,560 


32,049,234 


Enron 


36,692 


183,831 


727,044 


420 


17,744 


75,237,684 


916,075,440 


ca-HepPh 


12,008 


118,489 


3,358,499 


450 


39,633 


1.8839 xlO^ 


4.534x10^ 


AstroPh 


18,772 


198,050 


1,351,441 


350 


11,269 


148,765,753 


1.419x10^ 



Table 1. Values for the variables involved in our formulae for five 
real-world networks. 



AS: Autonomous Systems, Oregon: Oregon route views, Enron: Email communication network, 
ca-HepPh and AstroPh:Collaboration networks. Self-edges were removed. 
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Algorithm 2 MapReduce Colorful Triangle Counting G(y,E),p = 1/N 

Map: Input (e = {u, f{u),v, f{v)); 1) {Let / be a uniformly at random coloring 

of the vertices with colors} 

if f{u) = fiy) then emit (/(m); {u,v)) 

Reduce: Input (c; i?c = {{u,v)} <Z E) { Every edge {u,v) G Ec has color c, i.e., 

/(«) = f{v)} 

Scale each triangle by \. 



4. A MapReduce Implementation 

MapReduce [TU] has become the de facto standard in academia and industry for 
analyzing large scale networks. Recent work by Suri and Vassilvitskii [20J proposes 
two algorithms for counting triangles. The first is an efficient MapReduce imple- 
mentation of the Node Iterator algorithm, see also [I9] and the second is based on 
partitioning the graph into overlapping subsets so that each triangle is present in at 
least one of the subsets. 

Our method is amenable to being implemented in MapReduce and the skeleton 
of such an implementation is shown in Algorithm We implicitly assume that in a 
first round vertices have received a color uniformly at random from the N available 
colors and that we have the coloring information for the endpoints of each edge. 
Each mapper receives an edge together with the colors of its edgepoints. If the edge 
is monochromatic, then it's emitted with the color as the key and the edge as the 
value. Edges with the same color are shipped to the same reducer where locally 
a triangle counting algorithm is applied. The total count is scaled appropriately. 
Trivially, the following lemma holds by the linearity of expectation and the fact that 
the endpoints of any edge receive a given color c with probability p^. 

Lemma 2. The expected size to any reduce instance is 0{p'^m) and the expected 
total space used at the end of the map phase is 0{pm). 

5. Conclusions 

In this note we introduced a new randomized algorithm for approximate triangle 
counting, which is implemented easily in parallel. We showed such an implemen- 
tation in the popular MapReduce programming framework. The key idea which 
improves the existing work is that by our new sampling method the degree of the mul- 
tivariate polynomial expressing the number of triangles decreases by one, compared 
to previous work, e.g., [HI |22]. We used the powerful result of Hajnal-Szemeredi 
Theorem to obtain a concentration result which is unlikely to be the best possible. 
We observe that our result extends any subset of triangles satisfying some predicate 
(e.g., containing a certain vertex), in the sense that counting such triangles in the 
sample leads to a concentrated estimate of the number in the original graph. 

In future work we plan to investigate sampling methods for counting triangles in 
weighted graphs, other types of subgraphs and several systems-oriented aspects of 
our work. 

"'it's worth pointing out for completeness reasons that in practice one would not scale the 
triangles after the first reduce. It would emit the count of monochromatic triangles which would 
be summed up in a second round and scaled by 
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